Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tarantoolctl to test-run tool submodule and add replication_sync_timeout to it #242

Merged
merged 3 commits into from
Dec 5, 2020

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented Nov 24, 2020

  1. Move tarantoolctl to test-run tool submodule

    Moved tarantoolctl to test-run tool submodule repository as:

    /test-run/.tarantoolctl

    Also set backward compability to use old path location:

    /test/.tarantoolctl

    as its primary place.

    To set PWD environment for tarantoolctl for popen() call tried to use
    os.environ. It was needed for the later commits where tarantoolctl
    would be moved from to test-run repository and its configuration file
    would be read right from the working directory of the test. In this
    case all os.putenv() calls changed to environment set with os.environ,
    to be able to modify environment directly, check [1] ("Note: Calling
    putenv() directly does not change os.environ"). But in this case all
    the tests had to be changed to use os.environ instead of os.putenv(),
    it will block the use of os.putenv() in the future tests. Decided to
    avoid of this way and temporary change 'PWD' around the Popen() call.

    Needed for test: flaky hang tests tarantool#5504
    Part of Simplify configuration needed for setup tap / tarantool tests #78

    [1] - https://docs.python.org/2/library/os.html#process-parameters

  2. Setup replication_sync_timeout at .tarantoolctl

    Found that tests may fail due to hang in seek_wait() loop on starting
    and stopping instances. It happened, because instances were not synced
    within default output timeout which is by default 120 seconds, while
    replication sync happens only each 300 seconds by default. To fix it
    replication_sync_timeout should be decreased to some value lower than
    'no-output-timeout', so decided to set it to 100 seconds.

    The issue looked like in tests:

      --- replication/gh-5140-qsync-casc-rollback.result    Mon Oct 19
      17:29:46 2020
      +++ /rw_bins/test/var/070_replication/gh-5140-qsync-casc-rollback.result
      Mon Oct 19 17:31:35 2020
      @@ -169,56 +169,3 @@
       -- all the records are replayed one be one without yields for WAL writes, and
       -- nothing should change.
       test_run:cmd('restart server default')
      - |
      -test_run:cmd('restart server replica')
      - | ---
      - | - true
      - | ...
      -
    

    Part of test: flaky hang tests tarantool#5504

@Totktonada
Copy link
Member

Please, mark the first commit as part of #78.

@Totktonada
Copy link
Member

Please, also mention the issue in tarantool that motivates those changes.

@avtikhon avtikhon force-pushed the avtikhon/move_tarantoolctl_config branch from 1638eaa to 4320eff Compare November 30, 2020 06:34
@avtikhon avtikhon requested a review from Totktonada November 30, 2020 06:37
@avtikhon
Copy link
Contributor Author

Please, also mention the issue in tarantool that motivates those changes.

Made all the changes as you suggested, please review.

@avtikhon avtikhon force-pushed the avtikhon/move_tarantoolctl_config branch 2 times, most recently from e4fc9fb to 944f987 Compare November 30, 2020 07:03
@avtikhon avtikhon force-pushed the avtikhon/move_tarantoolctl_config branch 2 times, most recently from b190e0e to 8928998 Compare December 3, 2020 06:32
@Totktonada Totktonada force-pushed the avtikhon/move_tarantoolctl_config branch from 8928998 to 3a48123 Compare December 5, 2020 00:36
@Totktonada
Copy link
Member

Made several changes:

  1. Moved the comment re .tarantoolctl file search into the appropriate commit.
  2. Renamed repl_sync_timeout to replication_sync_timeout. There is no need to abbreviate it.
  3. Extracted PWD tweaking into its own commit, because it deserves its own explanation and may be considered as the bug fix on its own. Used os.environ here, because why not. Always set PWD to actual cwd with os.getcwd(), because, again, why not.
  4. Fixed misusage of 'tarantoolctl' word, where '.tarantoolctl' configuration file was meant.
  5. Rebased on top of the current master.

Copy link
Member

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some tweaks and now everything looks okay for me. I think we both spin around here enough time and the change can be pushed. I verified in-source and out-of-source builds before and after the corresponding tarantool patch and everything looks okay too.

@Totktonada
Copy link
Member

Found a nice side effect: now we can run ./test/test-run.py and cd test && ./test-run.py is not mandatory anymore. I'll add a note about this into the commit message.

@Totktonada Totktonada force-pushed the avtikhon/move_tarantoolctl_config branch from 3a48123 to 976c836 Compare December 5, 2020 01:06
Totktonada and others added 3 commits December 6, 2020 02:46
test-run copies .tarantooctl file from a test/ directory to a worker's
working directory (e.g. test/var/001_app). It would be quite intuitive
if tarantoolctl would use this copy when operating on tarantool
instances of the worker. Suprisingly, it is not so.

tarantoolctl checks whether the file exists inside a directory pointed
by the PWD environment variable, not a real current working directory.
test-run did not set PWD prior to this commit, however everything work.
Why? Because test-run.py is always executed from the test/ directory,
which contains the same .tarantoolctl file. Even `make test` / `make
test-force` targets issue commands like `cd <...>/test &&
<...>/test-run.py <...>` (where `cd` sets PWD).

The time has come. Let's set PWD and use test/var/001_app/.tarantoolctl
configuration for instances of the 001_app worker.

Side note: it would be good to fix tarantoolctl too to use a real cwd,
not PWD.

This inconsistency was spotted, when we perform attempt to remove the
test/.tarantoolctl config from the main tarantool repository, move it to
the test-run repository, but keep copying into a worker directory.

As a side effect, test-run now could be invoked without manual changing
a current directory to test/. So, where we run a command like this
(example for in-source build):

 | (cd test && ./test-run.py)

The following command can be used:

 | ./test/test-run.py

Nice!

Sadly, the trick `Popen(<...>, env=dict(os.environ, PWD=self.vardir))`
does not work for us, because we use os.putenv() in test-run and Python
tests. os.putenv() does not update os.environ, so we would feed
non-actual environment to the subprocess. Because of this, we just set
PWD before and after Popen() call.

Part of tarantool/tarantool#5504
Part of #78

Co-authored-by: Alexander V. Tikhonov <[email protected]>
Moved .tarantoolctl to test-run tool submodule repository as:

  <tarantool repository>/test-run/.tarantoolctl

Also set backward compability to use old path location:

  <tarantool repository>/test/.tarantoolctl

as its primary place.

Needed for tarantool/tarantool#5504
Part of #78
Found that tests may fail due to hang in seek_wait() loop on starting
and stopping instances. It happened, because instances were not synced
within default output timeout which is by default 120 seconds, while
replication sync happens only each 300 seconds by default. To fix it
replication_sync_timeout should be decreased to some value lower than
'no-output-timeout', so decided to set it to 100 seconds.

The issue looked like in tests:

  --- replication/gh-5140-qsync-casc-rollback.result    Mon Oct 19
  17:29:46 2020
  +++ /rw_bins/test/var/070_replication/gh-5140-qsync-casc-rollback.result
  Mon Oct 19 17:31:35 2020
  @@ -169,56 +169,3 @@
   -- all the records are replayed one be one without yields for WAL writes, and
   -- nothing should change.
   test_run:cmd('restart server default')
  - |
  -test_run:cmd('restart server replica')
  - | ---
  - | - true
  - | ...
  -

Part of tarantool/tarantool#5504
@Totktonada Totktonada force-pushed the avtikhon/move_tarantoolctl_config branch from 976c836 to d36db6c Compare December 5, 2020 23:47
@Totktonada
Copy link
Member

(Regarding the last force-push.) Fixed a typo in a commit message.

@Totktonada Totktonada merged commit e843552 into master Dec 5, 2020
@Totktonada Totktonada deleted the avtikhon/move_tarantoolctl_config branch December 5, 2020 23:49
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 6, 2020
See commits in the PR [1] for detailed description of the changes. User
visible changes are the following.

1. Now test-run.py can be invoked from any directory without changing a
   current working directory to `test/`.
2. The `test/.tarantoolctl` configuration file is not mandatory and can
   be removed. It is shipped now within the test-run repository.
3. test-run sets the `replication_sync_timeout` box.cfg() option when
   the `test/.tarantoolctl` is not present in a parent repository. The
   value is controlled by the --replication-sync-timeout argument and
   defaults to 100 seconds (unlike tarantool's default, which is 300
   seconds).

The reason of the changes is to set default `replication_sync_timeout`
for all tests to a value lower than `--no-output-timeout` (120 seconds)
to allow instances to step into the orphan mode before this deadline and
see more descriptive picture when it leads to failure of a test. What is
also important, when a test fails before the `--no-output-timeout`, we
able to restart it based on the `fragile` suite.ini option and / or
collect artifacts to store them in CI.

The `--no-output-timeout` deadline remains the show-stopper. We'll
introduce a test execution timeout later to step into the general
`--no-output-timeout` only in quite rare and unusual cases.

The next commit will actually remove `test/.tarantoolctl`, so the new
`replication_sync_timeout` will be in effect.

[1]: tarantool/test-run#242

Part of #5504
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 6, 2020
See commits in the PR [1] for detailed description of the changes. User
visible changes are the following.

1. Now test-run.py can be invoked from any directory without changing a
   current working directory to `test/`.
2. The `test/.tarantoolctl` configuration file is not mandatory and can
   be removed. It is shipped now within the test-run repository.
3. test-run sets the `replication_sync_timeout` box.cfg() option when
   the `test/.tarantoolctl` is not present in a parent repository. The
   value is controlled by the --replication-sync-timeout argument and
   defaults to 100 seconds (unlike tarantool's default, which is 300
   seconds).

The reason of the changes is to set default `replication_sync_timeout`
for all tests to a value lower than `--no-output-timeout` (120 seconds)
to allow instances to step into the orphan mode before this deadline and
see more descriptive picture when it leads to failure of a test. What is
also important, when a test fails before the `--no-output-timeout`, we
able to restart it based on the `fragile` suite.ini option and / or
collect artifacts to store them in CI.

The `--no-output-timeout` deadline remains the show-stopper. We'll
introduce a test execution timeout later to step into the general
`--no-output-timeout` only in quite rare and unusual cases.

The next commit will actually remove `test/.tarantoolctl`, so the new
`replication_sync_timeout` will be in effect.

[1]: tarantool/test-run#242

Part of #5504

(cherry picked from commit 6c8efa8)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 6, 2020
See commits in the PR [1] for detailed description of the changes. User
visible changes are the following.

1. Now test-run.py can be invoked from any directory without changing a
   current working directory to `test/`.
2. The `test/.tarantoolctl` configuration file is not mandatory and can
   be removed. It is shipped now within the test-run repository.
3. test-run sets the `replication_sync_timeout` box.cfg() option when
   the `test/.tarantoolctl` is not present in a parent repository. The
   value is controlled by the --replication-sync-timeout argument and
   defaults to 100 seconds (unlike tarantool's default, which is 300
   seconds).

The reason of the changes is to set default `replication_sync_timeout`
for all tests to a value lower than `--no-output-timeout` (120 seconds)
to allow instances to step into the orphan mode before this deadline and
see more descriptive picture when it leads to failure of a test. What is
also important, when a test fails before the `--no-output-timeout`, we
able to restart it based on the `fragile` suite.ini option and / or
collect artifacts to store them in CI.

The `--no-output-timeout` deadline remains the show-stopper. We'll
introduce a test execution timeout later to step into the general
`--no-output-timeout` only in quite rare and unusual cases.

The next commit will actually remove `test/.tarantoolctl`, so the new
`replication_sync_timeout` will be in effect.

[1]: tarantool/test-run#242

Part of #5504

(cherry picked from commit 6c8efa8)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 6, 2020
See commits in the PR [1] for detailed description of the changes. User
visible changes are the following.

1. Now test-run.py can be invoked from any directory without changing a
   current working directory to `test/`.
2. The `test/.tarantoolctl` configuration file is not mandatory and can
   be removed. It is shipped now within the test-run repository.
3. test-run sets the `replication_sync_timeout` box.cfg() option when
   the `test/.tarantoolctl` is not present in a parent repository. The
   value is controlled by the --replication-sync-timeout argument and
   defaults to 100 seconds (unlike tarantool's default, which is 300
   seconds).

The reason of the changes is to set default `replication_sync_timeout`
for all tests to a value lower than `--no-output-timeout` (120 seconds)
to allow instances to step into the orphan mode before this deadline and
see more descriptive picture when it leads to failure of a test. What is
also important, when a test fails before the `--no-output-timeout`, we
able to restart it based on the `fragile` suite.ini option and / or
collect artifacts to store them in CI.

The `--no-output-timeout` deadline remains the show-stopper. We'll
introduce a test execution timeout later to step into the general
`--no-output-timeout` only in quite rare and unusual cases.

The next commit will actually remove `test/.tarantoolctl`, so the new
`replication_sync_timeout` will be in effect.

[1]: tarantool/test-run#242

Part of #5504

(cherry picked from commit 6c8efa8)
@Totktonada
Copy link
Member

Updated the test-run submodule in tarantool in the following commits: 2.7.0-97-g6c8efa849, 2.6.1-84-g0d738037b, 2.5.2-48-gfa36857a3, 1.10.8-33-g64e2fb849.

@Totktonada
Copy link
Member

Removed the test/.tarantoolctl file from the tarantool repository in the following commits: 2.7.0-98-g54667f6fa, 2.6.1-85-g8d4623652, 2.5.2-49-gabd4660df, 1.10.8-34-gb5d31b60a. After this the new --replication-sync-timeout option (and its default) will be in effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants